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Abstract 

Penalized likelihood methods are fundamental to ultra-high dimensional variable se- 
lection. How high dimensionality such methods can handle remains largely unknown. 
In this paper, we show that in the context of generalized linear models, such methods 
possess model selection consistency with oracle properties even for dimensionality of Non- 
Polynomial (NP) order of sample size, for a class of penalized likelihood approaches using 
folded-concave penalty functions, which were introduced to ameliorate the bias problems 
of convex penalty functions. This fills a long-standing gap in the literature where the 
dimensionality is allowed to grow slowly with the sample size. Our results are also ap- 
plicable to penalized likelihood with the Li-penalty, which is a convex function at the 
boundary of the class of folded-concave penalty functions under consideration. The co- 
ordinate optimization is implemented for finding the solution paths, whose performance 
is evaluated by a few simulation examples and the real data analysis. 

Running title: Non-Concave Penalized Likelihood 

Key words: Variable selection; High dimensionality; Non-concave penalized likelihood; 
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1 Introduction 

The analysis of data sets with the number of variables p comparable to or much larger 
than the sample size n frequently arises nowadays in many fields ranging from genomics and 
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health sciences to economics and machine learning. The data that we collect is usually of 
the type {yi where the y^'s are n independent observations of the response 

variable Y given its covariates, or explanatory variables, {xn, ■ ■ ■ ,Xip)'^. Generalized linear 
models (GLMs) provide a flexible parametric approach to estimating the covariate effects 
(McCullagh and Nelder, 1989). In this paper we consider the variable selection problem of 
Non-Polynomial (NP) dimensionality in the context of GLMs. By NP-dimensionality we 
mean that logp = 0{n°') for some a £ (0,1). See Fan and Lv (2009) for an overview of 
recent developments in high dimensional variable selection. 

We denote by X = (xi,--- ,Xp) the nx p design matrix with Xj = {xij,--- ,Xnj)'^, 



j = !,■■■ ,pandy = (yi. 



,yv 



the n-dimensional response vector. Throughout the paper 



we consider deterministic design matrix. With a canonical link, the conditional distribution of 
y given X belongs to the canonical exponential family, having the following density function 
with respect to some fixed measure 



n " f 

/n(y; X, /3) = foivi- Gi) = '[{\ c{yi) exp 

1=1 i=l ^ 



b{9i) 



(1) 



where (3 = (/5i,--- , (3p)'^ is an unknown p-dimensional vector of regression coefficients, 
{/o(y;^) : G R} is a family of distributions in the regular exponential family with dis- 
persion parameter G (0,00), and (^i,--- ,9n)^ = X/3. As is common in GLM, the 
function b{9) is implicitly assumed to be twice continuously differentiable with b"(9) al- 
ways positive. In the sparse modeling, we assume that majority of the true regression 
coefficients /3g = (/3o,i) ■ " ■ >/3o,p)"^ are exactly zero. Without loss of generality, assume that 
Pq = (/3^,/3^)"^ with each component of f3i nonzero and P2 — 0- Hereafter we refer to the 
support supp(/3o) = {I,-- - ,s} as the true underlying sparse model of the indices. Vari- 
able selection aims at locating those predictors Xj with nonzero /?o,j and giving an effective 
estimate of (3i. 

In view of ([T]), the log- likelihood log /„(y; X, /3) of the sample is given, up to an affine 
transformation, by 

n 



where h{0) 
likelihood 



4(/3) = n-i[y^X/3-l^b(X/3)], (2) 
, h{6n))'^ for = (^1, • • • , Onf' . We consider the following penalized 



g„(/3) = 4(/3) 



p 

E 



PA„(|/3,|), 



(3) 



where p\{-) is a penalty function and A„ > is a regularization parameter. 

In a pioneering paper. Fan and Li (2001) build the theoretical foundation of nonconcave 
penalized likelihood for variable selection. The penalty functions that they used are not any 
nonconvex functions, but really the folded-concave functions. For this reason, we will call 
them more precisely folded-concave penalties. The paper also introduces the oracle property 
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for model selection. An estimator (3 = (Pi , )^ ^^i^ to have the oracle property (Fan and 
Li, 2001) if it enjoys the model selection consistency in the sense of = with probability 
tending to 1 as n — ^ oo, and it attains an information bound mimicking that of the oracle 
estimator, where 0i is a subvector of /3 formed by its first s components and the oracle knew 
the true model supp(/3o) = {I,-"" i ahead of time. Fan and Li (2001) study the oracle 
properties of non-concave penalized likelihood estimators in the finite-dimensional setting. 
Their results were extended later by Fan and Peng (2004) to the setting of p = o(n^/^) or 
o(n^/^) in a general likelihood framework. The question of how large p can be so that similar 
oracle properties continue to hold arises naturally. Can the penalized likelihood methods be 
applicable to NP-dimensional variable selection problems? This paper gives an affirmative 
answer. 

Numerous efforts have lately been devoted to studying the properties of variable selection 
with ultra-high dimensionality and significant progress has been made. Meinshausen and 
Biihlmann (2006), Zhao and Yu (2006), and Zhang and Huang (2008) investigate the issue 
of model selection consistency for LASSO under different setups when the number of variables 
is of a greater order than the sample size. Candes and Tao (2007) introduce the Dantzig 
selector to handle the NP-dimensional variable selection problem, which was shown to behave 
similarly to Lasso by Biekel et al. (2009). Zhang (2009) is among the first to study the 
non-convex penalized least-squares estimator with NP-dimensionality and demonstrates its 
advantages over LASSO. He also develops the PLUS algorithm to find the solution path that 
has the desired sampling properties. Fan and Lv (2008) and Huang et al. (2008) introduce 
the independence screening procedure to reduce the dimensionality in the context of least- 
squares. The former establishes the sure screening property with NP-dimensionality and 
the latter also studies the bridge regression, a folded-concave penalty approach. Fan and 
Fan (2008) investigate the impact of dimensionality on ultra-high dimensional classification 
and establish an oracle property for features annealed independence rules. Lv and Fan 
(2009) make important connections between model selection and sparse recovery using folded- 
concave penalties and establish a nonasymptotic weak oracle property for the penalized least 
squares estimator with NP-dimensionality. There are also a number of important papers on 
establishing the oracle inequalities for penalized empirical risk minimization. For example, 
Bunea et al. (2007) establish sparsity oracle inequalities for the Lasso under quadratic loss 
in the context of least-squares; van de Geer (2008) obtains a nonasymptotic oracle inequality 
for the empirical risk minimizer with the Li-penalty in the context of GLMs; Koltchinskii 
(2008) proves oracle inequalities for penalized least squares with entropy penalization. 

The penalization methods are also widely used in covariance matrix estimation. This 
has been studied by a number of authors on the estimation of sparse covariance matrix, 
sparse precision matrix, and sparse Cholesky decomposition, using the Gaussian likelihood 
or pseudo-likelihood. See, for example, Huang et al. (2006), Meinshausen and Biihlmann 
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(2006), Levina et al. (2008), Rothman et al. (2008), and Lam and Fan (2009), among others. 
For these more specific models, stronger results can be obtained. 

The rest of the paper is organized as follows. In Section [21 we discuss the choice of 
penalty functions and characterize the non-concave penalized likelihood estimator and its 
global optimality. We study the nonasymptotic weak oracle properties and oracle properties 
of non-concave penalized likelihood estimator in Sections [3] and HI respectively. Section [5] 
discusses algorithms for solving regularization problems with concave penalties including the 
SCAD. In Section [6l we present three numerical examples using both simulated and real 
data sets. We provide some discussions of our results and their implications in Section [71 
Proofs are presented in Section [HI Technical details are relegated to the Appendix. 



2 Non-concave penalized likelihood estimation 

In this section we discuss the choice of penalty functions in regularization methods and 
characterize the non-concave penalized likelihood estimator as well as its global optimality. 

2.1 Penalty function 

For any penalty function px{-), we let p{t; X) = X~^px{t). For simplicity, we will drop its 
dependence on A and write p{t; A) as p{t) when there is no confusion. Many penalty functions 
have been proposed in the literature for regularization. For example, the best subset selection 
amounts to using the Lq penalty. The ridge regression uses the L2 penalty. The Lg penalty 
p{t) = P for q G (0, 2) bridges these two cases (Prank and Friedman, 1993). Breiman (1995) 
introduces the non-negative garrote for shrinkage estimation and variable selection. Lasso 
(Tibshirani, 1996) uses the Li-penahzed least squares. The SCAD penalty (Fan, 1997; Fan 
and Li, 2001) is the function whose derivative is given by 

p' (t) = X<I(t< X) + -^I (t> X)} , t>0, for some a > 2, (4) 

I (a-l)A J 

where often a = 3.7 is used, and MCP (Zhang, 2009) is defined through p^(t) = (aA — t)^ /a. 
Clearly the SCAD penalty takes off at the origin as the Li penalty and then levels off, and 
MCP translates the flat part of the derivative of SCAD to the origin. A family of folded 
concave penalties that bridge the Lq and Li penalties were studied by Lv and Fan (2009). 
Hereafter we consider penalty functions p\{-) that satisfy the following condition: 

Condition 1. p{t; A) is increasing and concave int £ [0, 00), and has a continuous derivative 
p'{t;X) with p'{0+;X) > 0. In addition, p'{t]X) is increasing in X £ (0, 00) and p'{0+;X) is 
independent of X. 

The above class of penalty functions has been considered by Lv and Fan (2009). Clearly 
the Li penalty is a convex function that falls at the boundary of the class of penalty functions 
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satisfying Condition [TJ Fan and Li (2001) advocate penalty functions that give estimators 
with three desired properties: unbiasedness, sparsity and continuity, and provide insights 
into them (see also Antoniadis and Fan, 2001). Both SCAD and MCP with a > 1 satisfy 
Condition [1] and the above three properties simultaneously. The Li penalty also satisfies 
Condition [T] as well as the sparsity and continuity, but it does not enjoy the unbiasedness, 
since its derivative is identically one on [0, oo) with the derivative at zero understood as the 
right derivative. However, our results are applicable to the Li-penalized regression. Con- 
dition [1] is needed for establishing the oracle properties of non-concave penalized likelihood 
estimator. 

2.2 Non-concave penalized likelihood estimator 

It is generally difficult to study the global maximizer of the penalized likelihood analytically 
without concavity. As is common in the literature, we study the behavior of local maximizers. 

We introduce some notation to simplify our presentation. For any 6 = {9i, - ■ ■ , On)^ G 
R", define 

^l{e) = {b'iei), • • • , b'{en)f and ^{O) = diag{6"(^i), • • • , 6"(^„)}- (5) 

It is known that the n-dimensional response vector y following the distribution in ([T|) has 
mean vector fJ,{0) and covariance matrix (j)Jl{6)^ where 6 = X/3. Let p{t) = sgn(i)/9'(|f|), 
t G R and p(v) = {p{vi), • • • , p{vq))'^ , v = (f i, • • • , Vq)^ , where sgn denotes the sign function. 
We denote by || • \\q the Lq norm of a vector or matrix for q G [0, oo]. Following Zhang (2009), 
define the local concavity of the penalty p at v = (wi, • • • , Vq)^ G R*^ with ||v||o = q as 

( \ V p'{t2)- p'{h) 

k{p;v) = lim max sup . (bj 

e^0+ l<j<qi^<j2e(|t)j|-e,|i>j|+e) *2 " h 

By the concavity of p in Condition [H we have k{p;v) > 0. It is easy to show by the mean- 
value theorem that ^(/J; v) = maxi<j<g —p"{\vj\) provided that the second derivative of p is 
continuous. For the SCAD penalty, k(/9;v) = unless some component of |v| takes values 
in [A,aA]. In the latter case, k(/9; v) = (a — 1)^^A^^. 

Throughout the paper, we use Amin(-) and Amax(-) to represent the smallest and largest 
eigenvalues of a symmetric matrix, respectively. 

The following theorem gives a sufficient condition on the strict local maximizer of the 
penalized likelihood QniP) in © (see Lv and Fan (2009) for the case of penalized least 
squares) . 

Theorem 1 (Characterization of PMLE). Assume that p\ satisfies Condition [I]. Then 
(3 G R^ is a strict local maximizer of the non-concave penalized likelihood Qn{(3) defined by 
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m if 

Xfy-Xf/x(0)-nA„p(3i) = 0, (7) 

||z||oo < p'(0+), (8) 

>nXnK{p;Pi), (9) 

where Xi and X2 respectively denote the submatrices of X formed by columns in supp(/3) 
and its complement, = X/3, f3i is a subvector of P formed by all nonzero components, and 
z = (nAn)~^X|'[y — n{0)]. On the other hand, if (5 is a local maximizer of Qn{P), then it 
must satisfy (Qj - (0j with strict inequalities replaced by nonstrict inequalities. 

There is only a tiny gap (nonstrict versus strict inequalities) between the necessary con- 
dition for local maximizer and sufficient condition for strict local maximizer. Conditions 
([7]) and ^ ensure that /3 is a strict local maximizer of ^ when constrained on the ||/3||o- 
dimensional subspace {(3 £ W : (3^ = 0} of R^, where (3^ denotes the subvector of /3 formed 
by components in the complement of supp(/3). Condition ^ makes sure that the sparse 
vector (3 is indeed a strict local maximizer of ([3]) on the whole space R^. 

When p is the Li penalty, the penalized likelihood function Qn{(3) in ^ is concave in 
(3. Then the classical convex optimization theory applies to show that (3 = • • • , Pp)'^ is 
a global maximizer if and only if there exists a subgradient z G dLi{(3) such that 

X^y - X^n{d) - nXnZ = 0, (10) 

that is, it satisfies the Karush-Kuhn- Tucker (KKT) conditions, where the subdifferential 
of the Li penalty is given by dLi{j3) = {z = (zi,--- ,^;p)^ G R^ : Zj = sgn(/?j) for fjj ^ 
and Zj G [—1, 1] otherwise}. Thus condition (fTO|) reduces to ([7]) and ([8]) with strict inequal- 
ity replaced by nonstrict inequality. Since k{p; v) = for the Li-penalty, condition ([9]) holds 
provided that Xf S;(0)Xi is nonsingular. However, to ensure that (3 is the strict maximizer 
we need the strict inequality in ([5]). 

2.3 Global optimality 

It is a natural question of when the non-concave penalized maximum likelihood estimator 
(NCPMLE) /3 is a global maximizer of the penalized likelihood Qn{f3). We characterize such 
a property from two perspectives. 

2.3.1 Global optimality 

Assume that the n x p design matrix X has full column rank p. This implies that p < n. 
Since b"{9) is always positive, it is easy to show that the Hessian matrix of —in{P) is always 
positive definite, which entails that the log-likelihood function inif3) is strictly concave in 



-^min Xf S ( ) Xi 



6 



j3. Thus there exists a unique maximizer (3^ of £niP)- Let Cc = {P £ : iniP) > c} be a 
sublevel set of —iniP) for some c < ^n(O) and 

ti<t2G(0,oo) t2-ti 

be the maximum concavity of the penalty function px. For the Li penalty, SCAD, and 
MCP, we have K,{px) = 0, (a — and a~^, respectively. The following proposition gives 

a sufficient condition on the global optimality of NCPMLE. 

Proposition 1 (Global optimality). Assume that X has rank p and satisfies 

min A^in [n-Vi: (X/3) X] > k{pxJ. (11) 

Then the NCPMLE P is a global maximizer of the penalized likelihood Qn{P) if P & C^c- 
Note that for penalized least-squares, pT]) reduces to 

A„,in(n-iX^X) > k(paJ. (12) 

This condition holds for sufficiently large a in SCAD and MCP, when the correlation between 
covariates is not too strong. The latter holds for design matrices constructed by using spline 
bases to approximate a nonparametric function. According to Proposition [H under ()12p . the 
penalized least-squares with folded-concave penalty is a global minimum. 

The proposition below gives a condition under which the penalty term in (l3|) does not 
change the global maximizer. It will be used to derive the condition under which the PMLE 
is the same as the oracle estimator in Proposition El^b). Here for simplicity we consider 
the SCAD penalty p\ given by (jH), and the technical arguments are applicable to other 
folded-concave penalties as well. 



Proposition 2 (Robustness). Assume that X has rank p with p = s and there exists 
some c < £n{0) such that min^^^ Amin['^ "'^X"^S(X/3)X] > cq for some cq > 0. Then 
the SCAD penalized likelihood estimator P is the global maximizer and equals P^ if P & 
and min^^^ > (a + ^)Xn, where P = {Pi,-- - , Pp)'^ . 

2.3.2 Restricted global optimality 

When p > n, it is hard to show the global optimality of a local maximizer. However, we 
can study the global optimality of the NCPMLE P on the union of coordinate subspaces. A 
subspace of R^ is called coordinate subspace if it is spanned by a subset of the natural basis 
{ei,--- ,ep}, where each ej is the p- vector with j-ih component 1 and elsewhere. Here 
each Bj corresponds to the j-th. predictor xj. We will investigate the global optimality of P 
on the union of all s-dimensional coordinate subspaces of R^ in Proposition [3^a). 
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Of particularly interest is to derive the conditions under which the PMLE is also an 
oracle estimator, in addition to possessing the above restricted global optimal estimator on 
S^. To this end, we introduce an identifiability condition on the true model supp(/3o). The 
true model is called ^-identifiable for some 5 > if 

max 4(/3) - sup £„(/3) > 6, (13) 
/3e^u PeSs\Ao 

where = {{Pi-, " " " , Pp)'^ S R-^ : Pj = for j ^ supp(/3o)}- In other words, supp(/3o) is 
the best subset of size s, with a margin at least 6. The following proposition is an easy 
consequence of Propositions [1] and [2j 

Proposition 3 (Global optimality on S^). 

a) If the conditions of Proposition[l\ are satisfied for each n x (2s) suhmatrix o/X, then 
the NCPMLE f3 is a global maximizer of Qn{f3) on Sj,. 

b) Assume that the conditions of Proposition\^are satisfied for the nx s submatrix ofK. 
formed by columns in supp(/3o)j the true model is 5-identifiable for some 5 > 

and supp(/3) = supp(/3o). Then the SCAD penalized likelihood estimator j3 is the global 
maximizer on and equals to the oracle maximum likelihood estimator . 

On the event that the PMLE estimator is the same as the oracle estimator, it possesses 
of course the oracle property. 



3 Nonasymptotic weak oracle properties 

In this section we study a nonasymptotic property of the non-concave penalized likelihood 
estimator /3, called the weak oracle property introduced by Lv and Fan (2009) in the setting of 
penalized least squares. The weak oracle property means sparsity in the sense of = with 
probability tending to 1 as n ^ oo, and consistency under the Loo loss, where f3 = (/3^ , )^ 
and (3i is a subvector of /3 formed by components in supp(/3o) = {1, • " " ) •s}- This property 
is weaker than the oracle property introduced by Fan and Li (2001). 

3.1 Regularity conditions 

As mentioned before, we condition on the design matrix X and use the px penalty in the 
class satisfying Condition [TJ Let Xi and X2 respectively be the submatrices of the n x p 
design matrix X = (xi, • • • ,Xp) formed by columns in supp(/3o) and its complement, and 
Oq = X/3o. To simplify the presentation, we assume without loss of generality that each 
covariate Xj has been standardized so that ||xj||2 = y/n. If the covariates have not been 
standardized, the results still hold with ||xj||2 assumed to be in the order of ^/n. Let 

d„ = 2-imin{|/3oj| :/3o,i/0} (14) 
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be half of the minimum signal. We make the following assumptions on the design matrix 
and the distribution of the response. 

Let {bg} be a diverging sequence of positive numbers that depends on the nonsparsity 
size s and hence depends on n. Recall that Pi is the non-vanishing components of the true 
parameter Pq. 



Condition 2. The design matrix X satisfies 
[xfS(0o)Xi]"' =0{bsn-^ 

oo 

Xi^I](0o)Xi [xf5](0o)Xi]"' 



<min<;c4^,0(n"i) 

oo I p'[dn) 



max max^_i Ar, 



[Xfdiag{|x,| o |/x"(Xi(5)|}Xi] =0(n) 



(15) 
(16) 
(17) 



where the Loo norm of a matrix is the maximum of the Li norm of each row, C E (0,1), 
cei E [0, 1/2], A/q = E R** : \\5 — Pi\\oo < dn}, the derivative is taken componentwise, and 
o denotes the Hadamard (componentwise) product. 



Here and below, p is associated with regularization parameter satisfying (|T8|) unless 
specified otherwise. For the classical Gaussian linear regression model, we have /x(0) = 6 
and 5](0) = In- In this case, since we will assume that s <C n, condition (fT5|) usually 
holds with 6s = 1 if the covariates are nearly uncorrelated. In fact, Wainwright (2009) 
shows that || [Xf Xi]^^ ||oo = Op{n~''^) if the rows of Xi are i.i.d. Gaussian vectors with 



[EM Xi] 



we can take 6, 



Op{n ). In general, since 



[XfXi] -""lloo < A/s/Amin(Xf Xi 



if A„,in(Xf Xi, 

[Xf S (0o) X 



-1 



0{n ^). More generally, (fT5]) can be bounded as 



d 



-1 



and the above remark for the multiple regression model applies to the submatrix Xi^5, which 
consists of rows of the samples with b"(9i) > d for some d > 0. 

The left hand side of (jl6p is the multiple regression coefficients of each unimportant 
variable in X2 on Xi, using the weighted least squares with weights {b"{6i)}. Condition ()16p 
controls the uniform growth rate of the Li-norm of these multiple regression coefficients, a 
notion of weak correlation between Xi and X2. If each element of the multiple regression 
coefficients is of order 0(1), then the Li norm is of order 0(s). Hence, we can handle the non- 
sparse dimensionality s = 0(n"i), by (jlOp . as long as the first term in (jlOp dominates, which 
occurs for SCAD type of penalty with d„ » A„. Of course, the actual dimensionality can be 
higher or lower, depending on the correlation between Xi and X2, but for finite non-sparse 
dimensionality s = 0(1), (fT6]) is usually satisfied. When a folded-concave penalty is used. 
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the upper bound on the right hand side of (|16p can grow to oo at rate 0(n°^). In contrast, 
when the Li penalty is used, the upper bound in (I16p is more restrictive, requiring uniformly 
less than 1. This condition is the same as the strong irrepresentable condition of Zhao and 
Yu (2006) for the consistency of the LASSO estimator, namely ||X|'Xi(X^Xi)~^||oo < C. 
It is a drawback of the Li penalty. 

For the Gaussian linear regression model, condition ()17p holds automatically. 

We now choose the regularization parameter A„ and introduce Condition [3l We will 
assume that half of the minimum signal d„ > n~'''logn for some 7 G (0,1/2]. Take A„ 
satisfying 

p'xMn) = o{b-^n-nogn) and A„ > n-°(logn)2, (18) 
where a = min(^, 27 — ao) — ai and hg is associated with the nonsparsity size s = 0(n"°). 

Condition 3. Assume that dn > n^'^'logn and bg = o{mm{n^^'^^"' ^ylog n, 

s~^n'^ / log n)}. In addition, assume that A„ satisfies I118\) and A„ko = o{tq), where kq = 

tq — min^g^^^ Amin[^ "'^X^S(Xi(5)Xi], and that YD.a.x^_-^ ||xj||oo — oiri'^ j -y/log n) 
if the responses are unbounded. 

The condition that A„ko = o{tq), is needed to ensure condition The condition always 
holds when kq = and is satisfied for the SCAD type of penalty when (i„ ^ A„. 

In view of d?]) and to study the non-concave penalized likelihood estimator /3 we need 
to analyze the deviation of the p-dimensional random vector X"^Y from its mean X"^/x(0o)) 
where Y = (Yi, • • • ,1^)"^ denotes the n-dimensional random response vector in the GLM ([T|). 
The following proposition, whose proof is given in Section 18.51 characterizes such deviation 
for the case of bounded responses and the case of unbounded responses satisfying a moment 
condition, respectively. 

Proposition 4 (Deviation). Let Y = (Yi, • • • , Y^)'^ be the n-dimensional independent ran- 
dom response vector and a G R". Then 

a) IfYi,--- ,Yn are bounded in [c, d] for some c, d G R, then for any e G (0, 00), 

2e2 



P( a' Y-aV(^o) > < 2exp 



id 



(19) 



b) // Yi, • • • ,Yn are unbounded and there exist some M, vq G (0, 00) such that 



with (^0,: 



max E <j exp 



\Yi - b' {00,^)1 
M 



\Yi - b' {eo,i)\ 



M 



- 2 



6q, then for any e G (0, cxd), 



P(\a^Y -a^^l{Go)\ > e) < 2exp 



|a||2fo + ||a|| 



Me 



(20) 



(211 
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In light of ([T]) , it is known that for the exponential family, the moment-generating function 
of Yi is given by 

^exp {t [y, - b' (00,*)] } = exp {0-1 [b {00,, + t<P)-b {Bo,,) - b' (^o,*) t4>] } , 

where 0o,i + tfp is in the domain of b{-). Thus the moment condition (|20p is reasonable. It 
is easy to show that condition (j20p holds for the Gaussian linear regression model and for 
the Poisson regression model with bounded mean responses. Similar probability bounds also 
hold for sub-Gaussian errors. 

We now express the results in Proposition H] in a unified form. For the case of bounded 
responses, we define ip{e) = 2e~'^'^^^ for e G (0, oo), where ci = 2/{d — c)^. For the case of 

2 

unbounded responses satisfying the moment condition (j20p . we define ^{s) = 1e~'^^^ , where 
ci = l/(2t;o + 2M). Then the exponential bounds in ([19]) and ([2T|) can be expressed as 

P(|a^Y-aV(^o)| > llallae) < (^(e), (22) 

where e S (0,oo) if the responses are bounded and e G (0, ||a||2/||a||oo] if the responses are 
unbounded. 

3.2 Weak oracle properties 

Theorem 2 (Weak oracle property). Assume that Conditions{I\{3\ and the probability bound 
(2^) are satisfied, s = o{n), and \ogp = 0{n^~'^°'). Then there exists a non-concave penalized 
likelihood estimator (5 such that for sufficiently large n, with probability at least 1 — 2[sn^'^ + 
(p_ 5)e-^-^'^i°g"], 3 = satisfies: 

a) (Sparsity). fi^ = 0/ 

b) (Loo loss). ||3i - /3i||oo = 0(?^"^logn), 

where (3i and (3i are respectively the subvectors of (3 and (Bq formed by components in 
supp(/3o)- 

Under the given regularity conditions, the dimensionality p is allowed to grow up to 
exponentially fast with the sample size n. The growth rate of logp is controlled by 1 — 2a. 
It also enters the nonasymptotic probability bound. This probability tends to 1 under our 
technical assumptions. From the proof of Theorem[2l we see that with asymptotic probability 
one, the Lqo estimation loss of the non-concave penalized likelihood estimator (3 is bounded 
from above by three terms (see (jl5])), where the second term bgXnp' (dn) / p' (0+) is associated 
with the penalty function p. For the Li penalty, the ratio p' (dn) / p' (0+) is equal to one, 
and for other concave penalties, it can be (much) smaller than one. This is in line with the 
fact shown by Fan and Li (2001) that concave penalties can reduce the biases of estimates. 
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Under the specific setting of penalized least squares, the above weak oracle property is slightly 
different from that of Lv and Fan (2009). 

The value of 7 can be taken as large as 1/2 for concave penalties. In this case, the 
dimensionality that the penalized least-squares can handle is as high as logp = 0(n^"i) 
when Oo < 1/2, which is usually smaller than that for the case 

of 7 < i + ^a. The large 

value of 7 puts more stringent condition on the design matrix. To see this. Condition [3] 



entails that bg = o(^logn) and hence (fTSj) becomes tighter. 

In the classical setting of 7 = 1/2, the consistency rate of (3 under the L2 norm becomes 
Op{^/sn~^^'^logn), which is slightly slower than Op{y/sn~^^'^). This is because it is derived 
by using the Loo loss of (3 in Theorem [2)3). The use of the Loo norm is due to the technical 
difficulty of proving the existence of a solution to the nonlinear equation ([Tj) . 

3.3 Sampling properties of Li-based PMLE 

When the Li-penalty is applied, the penalized likelihood QniP) in & is concave. The local 
maximizer in Theorems [T] and [2] becomes the global maximizer. Due to its popularity, we 
now examine the implications of Theorem [2] in the context of penalized least-squares and 
penalized likelihood. 

For the penalized least-squares. Condition [2] becomes 

(XfXi)"' =0(6,n-i), (23) 
X^Xi(xfXi)~^ <C<1. (24) 

OD 

Condition (fT7|) holds automatically and Condition ([TH]) becomes 

Xn = o{bJ^n~'^ log n) and A„ > n""(logn)^. (25) 
As a corollary of Theorem [2l we have 

Corollary 1 (Penalized Li estimator). Under Conditions and and probability bound 
[^) . if s = o{n) and \ogp = 0{n^~'^'^), then the penalized Li likelihood estimator (5 has 
model selection consistency with rate \\j3i — fii\\oo = 0{n~'^\ogn). 

For the penalized least-squares. Corollary [1] continues to hold without normality assump- 
tion, as long as probability bound (|22p holds. In this case, the result is stronger than that 
of Zhao and Yu (2006) and Lv and Fan (2009). 

4 Oracle properties 

In this section we study the oracle property (Fan and Li, 2001) of the non-concave penalized 
likelihood estimator j3. We assume that the nonsparsity size s <^ n and the dimensionality 
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satisfies logp = 0{n") for some a G (0, 1/2), which is related to the notation in Section [3l 
We impose the following regularity conditions. 

Condition 4. The design matrix X satisfies 

min Amin [Xf S (Xi^) Xi] > cn, tr[Xf S(0o)Xi] = 0{sn), (26) 

||xi^S(0o)Xi||2_^ = 0(n), (27) 
max max^^-^ Amax [xfdiag{|xj| o (Xi^) I } Xi] = 0{n), (28) 

where A/q = {5 G R'* : \\S — Pi\\oo < dn}, c is some positive constant, and ||B||2,oo = 
max||v||2=i ||Bv||oo. 

Condition 5. Assume thatdn » A„ » max{(s/n)^/^, n("~^)/^(log n)^/^}, p'x^{dn) = 0(n"^/^), 
and XnKQ = o(l), where kq = maxg^j^^^K{p;S), and in addition that max^^j^ ||xj ||oo = 
o(n^2~ / ^logn) if the responses are unbounded. 

Condition m is generally stronger than Condition [2l In fact, by d„ ^ A^ in Condition [5l 
the first condition in (I16p holds automatically for SCAD type of penalties, since p^^((i„) = 
when n is large enough. Thus Condition [5] is less restrictive for SCAD-like penalties, since 
kq = for sufficiently large n. 

However, for the Li penalty, A„ = p'\^{dn) = Oin^^/"^) is incompatible with A„ ^ 
(s/n)^/^. This suggests that the Li penalized likelihood estimator generally cannot achieve 
the consistency rate of Op{y/sn^'^^'^) established in Theorem [3] and does not have the oracle 
property established in Theorem U when the dimensionality p is diverging with the sample 
size n. In fact, this problem was observed by Fan and Li (2001) and proved by Zou (2006) 
even for finite p. It still persists with growing dimensionality. 

We now state the existence of the NCPMLE and its rate of convergence. It improves the 
rate results given by Theorem [2j 

Theorem 3 (Existence of non-concave penalized likelihood estimator). Assume that Condi- 
tions\^ M and\^ and the probability bound (2^ hold. Then there exists a strict local maximizer 
/3 = {Pi ,/32 )^ of the penalized likelihood Qn{P) such that = with probability tending 
to 1 as n ^ oo and \\(3 — /3o||2 = Op{y/sn~^/'^), where f3i is a subvector of P formed by 
components in supp(/3o). 

Theorem [3] can be thought of as answering the question that given the dimensionality, 
how strong the minimum signal dn should be in order for the penalized likelihood estimator 
to have some nice properties, through Conditions [Hand El On the other hand, Theorem[2]can 
be thought of as answering the question that given the strength of the minimum signal dn, 
how high dimensionality the penalized likelihood methods can handle, through Conditions [2] 
andO While the details are different, these conditions are related. 
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To establish the asymptotic normahty, we need additional condition, which is related to 
the Lyapunov condition. 

Condition 6. Assume that p'^^{dn) = o{s'^/^n~^^^), maxf^-^^E\Yi - 6'(6'o,j)|^ = 0(1), and 
^"^-^(z^B~"^Zj)'^/^ ^0 as n ^ oo, where (Yi, • • • ,1^)"^ denotes the n- dimensional random 
response vector, (6*0,1, • • • , Oo^n)'^ = Oq, = X^S(0o)Xi, and Xi = (zi, • • • , z„)-^. 

Theorem 4 (Oracle property). Under the conditions of Theorem {3^ if Condition holds 
and s = o(n^/^), then with probability tending to 1 as n ^ 00, the non-concave penalized 
likelihood estimator (3 = (/3^ ,(32 Theorem\^ must satisfy: 

a) (Sparsity). '^2 = 0/ 

b) (Asymptotic normality). 

A„ [Xf S (0o) Xi] (3i - ^ iV(0, (^G), 

where A„ is a q x s matrix such that A„A^ ^G, Gisagxg symmetric positive definite 
matrix, and Pi is a subvector of (3 formed by components in supp(/3o). 

From the proof of Theorem U we see that for the Gaussian linear regression model, the 
additional restriction of s = o(n^/^) can be relaxed, since the term in (j28|) vanishes in this 
case. 

5 Implementation 

In this section, we discuss algorithms for maximizing the penalized likelihood Qn{(3) in ([3]) 
with concave penalties including the SCAD. Efficient algorithms for maximizing non-concave 
penalized likelihood include the LQA proposed by Fan and Li (2001) and LLA introduced 
by Zou and Li (2008). The coordinate optimization algorithm was used by Fu (1998) and 
Daubechies et al. (2004) for penalized least-squares with Lg-penalty. This algorithm can 
also be applied to optimize the group Lasso (Antoniadis and Fan, 2001; Yuan and Lin, 2006) 
as shown in Meier et al. (2008) and the penalized precision matrix estimation in Friedman 
et al. (2007). 

In this paper we employ a path-following algorithm, called the iterative coordinate ascent 
(ICA) algorithm. Coordinate optimization type algorithms are especially appealing for large 
scale problems with both n and p large. It successively maximizes Qn{f3) for regularization 
parameter A in a decreasing order. ICA uses the Gauss-Seidel method, i.e., maximizing 
one coordinate at a time with successive displacements. Specifically, for each coordinate 
within each iteration, ICA uses the second order approximation of ini(3) at the p- vector from 
the previous step along that coordinate and maximizes the univariate penalized quadratic 
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approximation. It updates each coordinate if the maximizer of the corresponding univariate 
penaUzed quadratic approximation makes Qn{(3) strictly increase. Therefore, ICA algorithm 
enjoys the ascent property, i.e., the resulting sequence of Qn values is increasing for a fixed 
A. 

When (-niP) is quadratic in /3, e.g., for the Gaussian linear regression model, the second 
order approximation in ICA is exact at each step. For any 5 G and j G {!,••• we 
denote by ini(3', S,j) the second order approximation of iniP) at S along the j-th component, 
and 

_ _ p 

Qnifif,S,j)=iniP;S,j)-^Pxm\), (29) 

i=i 

where the subvector of /3 with components in {1, • • • ,p}\{j} is identical to that of d. Clearly 
maximizing Qn{-', is a univariate penalized least squares problem, which admits analyt- 
ical solution for many commonly used penalty functions. See the Appendix for formulae for 
three popular GLMs. 

Pick Amax G (0, oo) Sufficiently large such that the maximizer of Qnif3) with A = Amax is 
0, a decreasing sequence of regularization parameters {Ai, • • • , Xk} with Ai = Amax, and the 
number of iterations L. 

ICA ALGORITHM. 

1. Set k = 1 and initialize f3 ° = 0. 

2. Initialize f3 = (3 ^ , and set S = {1, • • • ,p} and 1 = 1. 

3. Successively for j e 5, let [ij be the maximizer of QniPj^P and update the j- 

— — — 

th component of f3 as f3j if the updated /3 strictly increases Q„(/3). Set S <— 
supp(3^'') U {j : \zj\ > p'(0+)} and £ ^ ^ + 1, where (zi, • • • , Zp)^ = (nAfc)"^X'^[y - 
A.(X3'')]. 

4. Repeat Step 3 until convergence or £ = L + 1. Set A; <— A; + 1. 

5. Repeat Steps 2-4 until k = K + 1. Return p-vectors (3 \ • • • ,/3 ^. 

— A^ 

When we decrease the regularization parameter from A^ to A^+i, using /3 as an initial 
value for (3 ''^^ can speed up the convergence. The set S is introduced in Step 3 to reduce 
the computational cost. It is optional to add {j : \zj\ > p'(0+)} to the set S in this step. In 
practice, we can set a small tolerance level for convergence. We can also set a level of sparsity 
for early stopping if desired models are only those with size up to a certain level. When the 
Li penalty is used, it is known that the choice of A = n~-^||X"^[y — /x(0)]||oo ensures that 
is the global maximizer of ([3]). In practice, we can use this value as a proxy for Amax- We 
give the formulas for three commonly used GLMs and the univariate SCAD penalized least 
squares solution in Sections lA.ll and IA.2I in the Appendix, respectively. 
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Figure 1: Boxplots of PE, L2 loss, and #S over 100 simulations for all methods in logistic 
regression, where p = 25. The x-axis represents different methods. Top panel is for BIC and 
bottom panel is for SIC. 



6 Numerical examples 
6.1 Logistic regression 

In this example, we demonstrate the performance of non-concave penalized likelihood meth- 
ods in logistic regression. The data were generated from the logistic regression model ([I]). 
We set {n,p) = (200,25) and chose the true regression coefficients vector (3q by setting 
f3i = (2.5, —1.9, 2.8, —2.2, 3)"^. The number of simulations was 100. For each simulated data 
set, the rows of X were sampled as i.i.d. copies from A^(0, Sq) with Sq = (0.5l*~-'l)ij=i^... ^p, 
and the response vector y was generated independently from the Bernoulli distribution with 
conditional success probability vector g'(X/3o), where g{x) = / {1 + e^). We compared Lasso 
(Li penalty), SCAD and MCP with the oracle estimator, all of which were implemented by 
the ICA algorithm to produce the solution paths. The regularization parameter A was se- 
lected by BIC and the semi-Bayesian information criterion (SIC) introduced by Lv and Liu 
(2008). 

Six performance measures were used to compare the methods. The first measure is the 
prediction error (PE) defined as E\Y — gi^S?" (j)^ , where 13 is the estimated coefficients vector 
by a method and (X"^, Y) is an independent test point. The second and third measures are 
the L2 loss 11/3 — /9q II 2 and Li loss ||/3 — /3o||i. The fourth measure is the deviance of the fitted 
model. The fifth measure, ^^S, is the number of selected variables in the final model by a 
method in a simulation. The sixth one, FN, measures the number of missed true variables 
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Table 1: Medians and robust standard deviations (in parentheses) of PE, L2 loss, Li loss, 
deviance, 7^8, and FN over 100 simulations for all methods in logistic regression by BIC and 
SIC, where p = 25 



Method 


Measures 


Lasso 


SCAD 


MCP 


Oracle 


BIC 


PE 


0.110(0.008) 


0.097(0.006) 


0.097(0.006) 


0.093(0.004) 




L2 loss 


3.055(0.656) 


0.943(0.550) 


0.943(0.550) 


0.880(0.339) 




Li loss 


7.247(1.095) 


1.867(1.461) 


1.867(1.461) 


1.732(0.767) 




Deviance 


129.36(19.20) 


111.82(15.80) 


111.82(15.80) 


113.12(16.05) 




#s 


9(2.97) 


5(0.74) 


5(0.74) 


5(0) 




FN 


0(0) 


0(0) 


0(0) 


0(0) 


SIC 


PE 


0.114(0.010) 


0.095(0.005) 


0.095(0.005) 


0.093(0.004) 




L2 loss 


3.342(0.600) 


0.943(0.476) 


0.943(0.476) 


0.880(0.339) 




Li loss 


7.646(1.114) 


1.799(1.006) 


1.799(1.006) 


1.732(0.767) 




Deviance 


134.93(18.35) 


112.22(16.30) 


112.22(16.30) 


113.12(16.05) 




#s 


9(2.22) 


5(0) 


5(0) 


5(0) 




FN 


0(0) 


0(0) 


0(0) 


0(0) 
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Figure 2: Boxplots of PE, L2 loss, and #S over 100 simulations for all methods in logistic 
regression, where p = 500 and 1000. The x-axis represents different methods. Top panel is 
for p = 500 and bottom panel is for p = 1000. 
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by a method in a simulation. 

In the calculation of PE, an independent test sample of size 10,000 was generated to 
compute the expectation. For both BIC and SIC, Lasso had median FN = with some 
nonzeros, and SCAD and MCP had FN = over 100 simulations. Table [J and Figure [1] 
summarize the comparison results given by PE, L2 loss, Li loss, deviance, #S, and FN, 
respectively for BIC and SIC. The Lasso selects larger model sizes than SCAD and MCP. 
Its associated median losses are also larger. 

We also examined the performance of non-concave penalized likelihood methods in high 
dimensional logistic regression. The setting of this simulation is the same as above, except 
that p = 500 and 1000. Since p is larger than n, the information criteria break down in the 
tuning of A due to the overfitting. Thus we used five-fold cross-validation based on prediction 
error to select the tuning parameter. Lasso had many nonzeros of FN, and SCAD and MCP 
had FN = over almost all 100 simulations except very few nonzeros. Table [2] and Figure [5] 
report the comparison results given by PE, L2 loss, Li loss, deviance, #S, and FN. 

It is clear from Table [2] that LASSO selects far larger model size than SCAD and MCP. 
This is due to the bias of the Li penalty. The larger bias in LASSO forces the cross-validation 
to choose a smaller value of A to reduce its contribution to PE. But, a smaller value of A 
allows more false positive variables to be selected. The problem is certainly less severe for 
the SCAD penalty and MCP. The performance between SCAD and MCP is comparable, as 
expected. 

6.2 Poisson regression 

In this example, we demonstrate the performance of non-concave penalized likelihood meth- 
ods in Poisson regression. The data were generated from the Poisson regression model ([T|). 
The setting of this example is similar to that in Section [6.11 We set (n,p) = (200,25) and 
chose the true regression coefficients vector /3q by setting (Bi = (1.25, —0.95, 0.9, —1.1, 0.6)-^. 
For each simulated data set, the response vector y was generated independently from the 
Poisson distribution with conditional mean vector exp(X/3Q). The regularization parameter 
A was selected by BIC (SIC performed similarly to BIC). 

The PE is defined as E\Y — exp(X"^/3)]^, where (3 is the estimated coefficients vector by 
a method and (X^,y) is an independent test point. Lasso, SCAD and MCP had FN = 
over 100 simulations. Table [3] summarizes the comparison results given by PE, L2 loss, Li 
loss, deviance, #S, and FN. 

We also examined the performance of non-concave penalized likelihood methods in high 
dimensional Poisson regression. The setting of this simulation is the same as above, except 
that p = 500 and 1000. The regularization parameter A was selected by BIC and five-fold 
cross-validation (CV) based on prediction error. For both BIC and CV, Lasso had median 
FN = with some nonzeros, and SCAD and MCP had FN = over 100 simulations. Table 
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Table 2: Medians and robust standard deviations (in parentheses) of PE, L2 loss, Li loss, 
deviance, #S, and FN over 100 simulations for all methods in logistic regression, where 
p = 500 and 1000 



p 


Measures 


Lasso 


Q.n A F> 




Oracle 


ouu 


rtij 


u.ui4y(^u.uio ) 


n c\QK(c\ c\c\P.\ 
u.uyo^^u.uuD J 


u.uyD^^u.uu / J 


u.uy4n^u.uuz ) 




L2 loss 






i.ioLn^u.yoo j 






Li loss 


11.540(0.841) 


2.508(2.044) 


2.481(2.292) 


1.591(0.939) 




Deviance 


113.84(43.76) 


100.22(16.03) 


102.96(15.36) 


108.06(17.33) 




#s 


41(20.39) 


9(3.71) 


6(1.48) 


5(0) 




FN 


0(0.74) 


0(0) 


0(0) 


0(0) 


1000 


PE 


0.163(0.010) 


0.096(0.020) 


0.096(0.007) 


0.093(0.003) 




L2 loss 


4.753(0.333) 


1.400(1.591) 


1.010(1.000) 


0.808(0.517) 




Li loss 


11.759(0.801) 


3.133(3.297) 


2.322(2.145) 


1.490(0.949) 




Deviance 


152.19(49.36) 


99.18(19.80) 


103.25(16.99) 


110.03(14.49) 




#S 


28.5(24.83) 


13(4.45) 


7(2.22) 


5(0) 




FN 


1(0.74) 


0(0) 


0(0) 


0(0) 



Table 3: Medians and robust standard deviations (in parentheses) of PE, L2 loss, Li loss, 
deviance, #S, and FN over 100 simulations for all methods in Poisson regression, where 
p = 2b 



Measures 


Lasso 


SCAD 


MCP 


Oracle 


PE 


7.195(2.428) 


4.081(0.826) 


4.012(0.791) 


3.688(0.574) 


L2 loss 


0.269(0.076) 


0.141(0.045) 


0.136(0.040) 


0.111(0.035) 


Li loss 


0.606(0.215) 


0.276(0.103) 


0.271(0.094) 


0.216(0.067) 


Deviance 


191.09(14.62) 


186.73(12.72) 


187.23(13.14) 


187.72(15.28) 


#s 


9(2.22) 


5(0.74) 


5(0.74) 


5(0) 


FN 


0(0) 


0(0) 


0(0) 


0(0) 
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Table 4: Medians and robust standard deviations (in parentheses) of PE, L2 loss, Li loss, 
deviance, #S, and FN over 100 simulations for all methods in Poisson regression by BIG and 
GV, where p = 500 and 1000 



p Method 


Measures 


Lasso 


SGAD 


MGP 


Oracle 


500 BIG 


PE 


26.989(11.339) 


4.820(1.772) 


4.672(1.593) 


3.479(0.738) 




1/2 loss 


0.790(0.206) 


0.199(0.074) 


0.178(0.076) 


0.104(0.043) 




Li loss 


2.446(0.638) 


0.424(0.165) 


0.371(0.161) 


0.184(0.083) 




Deviance 


202.65(22.23) 


187.15(16.41) 


189.66(17.24) 


189.30(21.73) 




#s 


29(5.93) 


9(3.34) 


7(2.22) 


5(0) 




FN 


0(0) 


0(0) 


0(0) 


0(0) 


GV 


PE 


24.200(9.636) 


4.542(1.554) 


4.272(1.503) 


3.479(0.738) 




L2 loss 


0.698(0.162) 


0.168(0.065) 


0.168(0.057) 


0.104(0.043) 




Li loss 


3.229(1.368) 


0.495(0.201) 


0.411(0.162) 


0.184(0.083) 




Deviance 


117.16(40.13) 


166.58(21.76) 


173.01(19.17) 


189.30(21.73) 




#s 


63.5(24.83) 


18(10.75) 


12.5(6.67) 


5(0) 




FN 


0(0) 


0(0) 


0(0) 


0(0) 


1000 BIG 


PE 


33.069(14.089) 


5.523(2.027) 


5.144(1.808) 


3.676(0.772) 




L2 loss 


0.971(0.209) 


0.210(0.094) 


0.187(0.088) 


0.108(0.047) 




Li loss 


2.990(0.689) 


0.485(0.232) 


0.443(0.198) 


0.197(0.090) 




Deviance 


199.99(22.89) 


180.34(13.07) 


181.21(15.31) 


187.98(17.22) 




#s 


34(7.41) 


11.5(4.08) 


9(2.22) 


5(0) 




FX 


0(0) 


0(0) 


0(0) 


0(0) 


GV 


PE 


31.701(16.571) 


4.821(1.732) 


4.700(1.702) 


3.676(0.772) 




L2 loss 


0.889(0.201) 


0.162(0.077) 


0.162(0.064) 


0.108(0.047) 




Li loss 


4.297(1.646) 


0.506(0.341) 


0.454(0.239) 


0.197(0.090) 




Deviance 


92.89(44.51) 


160.23(20.80) 


169.34(23.44) 


187.98(17.22) 




#s 


83(40.77) 


22(11.86) 


14(7.04) 


5(0) 




FN 


0(0) 


0(0) 


0(0) 


0(0) 
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Table 5: Classification errors in the neuroblastoma data set 



3-year EFS 



Gender 



Method # of genes Test error 



^ of genes Test error 



Lasso 56 23/114 

SCAD 10 18/114 

MCP 7 23/114 

SIS 5 19/114 

ISIS 23 22/114 



4 5/126 

2 4/126 

1 12/126 
6 4/126 

2 4/126 



m reports the comparison results given by PE, L2 loss, Li loss, deviance, #S, and FN. 
6.3 Real data analysis 

In this example, we apply non-concave penalized likelihood methods to the neuroblastoma 
data set, which was studied by Oberthuer et al. (2006). This data set, obtained via the 
MicroArray Quality Control phase-II (MAQC-II) project, consists of gene expression profiles 
for 10,707 genes from 251 patients of the German Neuroblastoma Trials NB90-NB2004, 
diagnosed between 1989 and 2004. The patients at diagnosis were aged from to 296 
months with a median age of 15 months. The study aimed to develop a gene expression- 
based classifier for neuroblastoma patients that can reliably predict courses of the disease. 

We analyzed this data set for two binary responses: 3-year event-free survival (3-year 
EFS) and gender, where 3-year EFS indicates whether a patient survived 3 years after the 
diagnosis of neuroblastoma. There are 246 subjects with 101 females and 145 males, and 
239 of them have the 3-year EFS information available (49 positives and 190 negatives). 
We applied Lasso, SCAD and MCP using the logistic regression model. Five-fold cross- 
validation was used to select the tuning parameter. For the 3-year EFS classification, we 
randomly selected 125 subjects (25 positives and 100 negatives) as the training set and the 
rest as the test set. For the gender classification, we randomly chose 120 subjects (50 females 
and 70 males) as the training set and the rest as the test set. Table [5] reports the classification 
results of all methods, as well as those of SIS and ISIS, which were extracted from Fan et al. 
(2009). Tables El and [3 fist the selected genes by Lasso, SCAD and MCP for the 3-year EFS 
classification and gender classification, respectively. 

7 Discussions 

We have studied penalized likelihood methods for ultra-high dimensional variable selection. 
In the context of GLMs, we have shown that such methods have model selection consis- 
tency with oracle properties even for NP-dimensionality, for a class of non-concave penalized 
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Table 6: Selected genes for the 3-year EFS classification 



Gene 


Lasso 


bCAD 




Gene 


Lasso 


SGAD 


MGP 


A_24_P182182 




X 




Hs419768. 1 


X 






A_23_P144096 


X 






A_23_P313728 


X 






A_23_P124514 


X 






A_23_P12884 


X 






A_23_P502879 


X 






A_23_P130626 


X 






A_23_P71319 


X 






A_23_P143958 




X 




A_24_P73158 


X 


X 




Hsl55462. 1 




X 




A_24_P282251 


X 




X 


A_23_P209394 


X 






A_23_P125435 


X 






A_24_P100419 


X 






A_23_P80491 


X 






Hs379382 . 1 


X 






A_23_P77779 


X 






A_24_P271696 


X 






A_23_P36076 


X 


X 




TT f\f\A A rt"^ A 

Hs381187. 1 


X 






A_23_P35349 


X 






Hs265827 . 1 


X 






A_23_P208030 


X 






Hs269914.3 


X 






A_23_P72737 


X 




X 


Hs36034. 1 


X 






A_23_P25194 


X 






A_23_P83751 


X 






A_23_P200043 


X 






A_23_P96325 


X 






A_23_P422809 


X 






A_23_P97553 


X 






A_23_P1 10345 


X 




X 


A_24_P232158 


X 






A_23_P5131 


X 




X 


A_23_P9836 


X 






A_23_P11859 


X 






TT A A 

Hsl70298. 1 


X 




X 


A_23_P7376 


X 






r60_al35 


X 






A_23_P211985 


X 






A_23_P133956 


X 






A_24_P365954 




X 




A_32_P27511 


X 






A_23_P86975 


X 






A_23_P80626 


X 






A_2d_Foyy 10 


X 






A_j2_Flbo / Oo 


X 






A_24_P285055 


X 






A_23_P100764 




X 




A_23_P68547 


X 






Hs407755 . 1 


X 






A_23_P6252 


X 






Hs86643.1 


X 






A_23_P386356 


X 






Hs422789 . 1 


X 


X 




A_24_P50228 






X 


A_23_P131866 




X 




Hs37637.1 


X 






A_23_P147397 


X 






Hs457415.1 




X 


X 


A_23_P13852 


X 
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Table 7: Selected genes for the gender classificatfon 
Gene Lasso SCAD MCP 

A_23_P329835 x 
A_23_P259314 x 
A_23_P137238 x x x 

A_24_P500584 x x 



likelihood approaches. Our results are consistent with a known fact in the literature that 
concave penalties can reduce the bias problems of convex penalties. The convex function of 
Li-penalty falls at the boundary of the class of penalty functions under consideration. We 
have used the coordinate optimization to find the solution paths and illustrated the per- 
formance of non-concave penalized likelihood methods with numerical studies. Our results 
show that the coordinate optimization works equally well and efficiently for producing the 
entire solution paths for concave penalties. 

8 Proofs 

8.1 Proof of Theorem [1] 

We will first derive the necessary condition. In view of ([2]), we have 

V£n{(3) = n-^ [X^y - X^/i {0)] and V^4(/3) = -n^^X^S {6) X, (30) 

where 6 = X/3. It follows from the classical optimization theory that if /3 = • • • , /3p)"^ is 
a local maximizer of the penalized likelihood (l3|), it satisfies the Karush-Kuhn- Tucker (KKT) 
conditions, i.e., there exists some v = • • • , Vp)^ S such that 

X^y - X^/i(0) - nA„v = 0, (31) 

where 6 = x3, Vj = p0j) for / 0, and Vj £ [-p'(0+), p'(0+)] for /3j = 0. Let S = 
supp(/3). Note that f3 is also a local maximizer of ^ constrained on the ||/3||o-dimensional 
subspace B = {P £ HP ■.13^ = 0} of R^, where (3^ denotes the subvector of /3 formed by 
components in S'^, the complement of 5. It follows from the second order condition that 

>nA„K(p;3i), (32) 

where k{p; /3^) is given by ([6]). It is easy to see that equation (j3ip can be equivalently written 
as 

Xf y - Xjnid) - n\np0i) = 0, (33) 
||z||oc < p'(0+), (34) 



Xf S ( 6> ) Xi 
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where z = (nXn) ^X^[y — /Lt(0)] and X2 denotes the submatrix of X formed by columns in 
S". 

We now prove the sufficient condition. We first constrain the penahzed hkehhood ([3]) 
on the ||/3||o-dimensional subspace B of R^. It fohows from condition ([9]) that Qn{P) is 
strictly concave in a ball A/q in the subspace B centered at (3. This along with equation ([7]) 
immediately entails that /3, as a critical point of Qn{P) in B, is the unique maximizer of 
QniP) in the neighborhood A/q. 

It remains to prove that the sparse vector /3 is indeed a strict local maximizer of Qn{P) 
on the space R^. To show this, take a sufficiently small ball A/i in R^ centered at f3 such 
that A/i n ^ C TVo. We then need to show that Qn0) > <5n(7i) for any 7^ G M \ A/q. Let 
72 be the projection of onto the subspace B. Then we have 72 G A/q, which entails that 
QniP) > Qn{l2) if 72 7^ Z^) sincc j3 is the strict maximizer of Qn{/3) in A/q. Thus, it suffices 
to show that Qn (72) > Qn(7l)- 

By the mean-value theorem, we have 

Qnhl) - Qnil2) = V^Q„(7o)(7l - 72), (35) 

where 79 lies on the line segment joining 72 ^nd "fi- Note that the components of 7^ — 72 
are zero for the indices in S and the sign of 70 j is the same as that of 71 j for j ^ S, where 
7o,j and jij are the j-th components of 70 and 7^^, respectively. Therefore, the right hand 
side of ([35]) can be expressed as 

{n"^X^[y- /i(X7o)]}^7i,2 - >^n^ p{ho,j\)hi,j\^ (36) 

where 7^^ 2 is a subvector of -yi formed by the components in 5^. By 7^ G A/i \ A/q, we have 
7i,2 / 0. 

It follows from the concavity of p in Condition [1] that p'{t) is decreasing in t £ [0,oo). 
By condition ^ and the continuity of p'{t) and b'{9), there exists some (5 > such that for 
any /3 in a ball in R^ centered at (3 with radius 5, 

||(nA„)-ixl^ [y - /x(X/3)] lU < p{6). (37) 

We further shrink the radius of the ball A/i to less than 6 so that |7o,j| < \li,j\ < ^ ^or j ^ S 
and ([37|) holds for any (3 £ Mi. Since 79 G A/i, it follows from ([37|) that the term ([36]) is 
strictly less than 

AnP'(5)||7i,2lli-A„p'(5)||7i,2lli = 0, 

where the monotonicity of p'{-) was used in the second term. Thus we conclude that Qn{li) < 
Qn (72)- This completes the proof. 
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8.2 Proof of Proposition [T] 

Let dCc = {/9 G : iniP) = c} be the level set. By the concavity of in{l3), we can easily 
show that for c < ^n(O), Cc is a closed convex set with and being its interior points and 
the level set dCc is its boundary. We now show that the global maximizer of the penalized 
likelihood Qnif3) belongs to Cc- 

For any 7 G dCc, let T-y = {t^ : t G (1, 00)} be a ray. By the convexity of Cc, we have 
{tj : t £ [0, 1]} C Cc for 7 G dCc, which implies that 

U T^ = np\Cc. 

Thus to show that the global maximizer of Qn{l3) belongs to C^ it suffices to prove Qn{tj) < 
Qnil) for any t G (l,oo) and 7 G dCc- This follows easily from the definition of Qnif3), 
Lit-f) <c = Inil), and Ej=iPA„(t|7il) > Ej=iPA„(|7il), where 7 = (71, • • • ,jpf. 

It remains to prove that the local maximizer of Qni(3) in Cc must be a global maximizer. 
This is entailed by the concavity of QniP) on Cc, which is ensured by condition (jlip . This 
concludes the proof. 

8.3 Proof of Proposition [2] 

Since c < ^n(O), from the proof of Proposition [T] we know that the global maximizer of the 
penalized likelihood Qn{P) belongs to Cc- Note that by assumption, the SCAD penalized 
likelihood estimator f3 = {f3i, - - - G Cc and min^^^^ \l3j\ > aXn- It follows from ([3]) and 

([!]) that /3 is a critical point of in{P) and thus /3 = by the strict concavity of in{/3)- It 
remains to prove that (3^ is the maximizer of Qn{P) on Cc- 

The key idea is to use a first order Taylor expansion of in{P) around and retain the La- 
grange remainder term. This along with V£„(/3^) = and min^^^ Aniin['^^^X^5](X/3)X] > 
Co gives for any P £ Cc, 

QniP) < ^ in{(3,) " | 11/3 " PJI " EpA„(|/3,|), 

since is in the convex set Cc- Thus if is the global maximizer of Qn{P) on R^, then 
we have for any (3 £ Cc, 

This entails that (3^ is the global maximizer of Qn{(3). 

To maximize Qn{l3), we only need to maximize it componentwise. Let f3^ = (/3*,i, • • • , f3*,pY' - 
Then it remains to show that for each j = 1, - - - ,p, f3^j is the global minimizer of the uni- 
variate SCAD penalized least squares problem 

min5,(/5) = mm{| (/3 + PxM)} ■ (38) 
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This can easily been shown from the analytical solution to (|38p. For the sake of completeness, 
we give a simple proof here. 

Recall that we have shown that /3 = /3^. In view of ()38p and > aXn, for any 

\f3\ > aXn with (3 ^ (3^j, we have 

9j{P) > P\M) = PxA^K) = 9j{\P*,j\), 

where we used the fact that px„{-) is constant on [aXn, oo). Thus, it suffices to prove gj{(3) > 
P\„{aXn) on the interval \f3\ < aXn- For sucha/3, we have pA„(«A„)-pA„ (|/?|) < A„(aA„-|/?|). 
Thus we need to show that 

min l'^ {\f3^j\ - aXn + - Xnz\ > 0, 

2g[0,aA„] L z J 

which always holds as long as > (a + ^)Xn and thus completes the proof. 

8.4 Proof of Proposition [3] 

Let A be any s-dimensional coordinate subspace different from = • • • ,/3p)^ G : 
Pj = for j ^ supp(/3)}. Clearly © ^ is a d-dimensional coordinate subspace with 
d < 2s. Then part a) follows easily from the assumptions and Proposition [TJ Part b) is an 
easy consequence of Proposition [2] in view of the assumptions and the fact that 

/.N r ^ \ + l)-^n 

max px„{t) =px„{aXn) = 

ie[0,oo) 2 

for the SCAD penalty p\ given by (j4]). 

8.5 Proof of Proposition [4] 

Part a) follows easily from a simple application of Hoeffding's inequality (Hoeffding, 1963), 
since oiYi, • • • , a„y„ are n independent bounded random variables, where a = (oi, • • • , a„)"^. 
We now prove part b). In view of condition ([20]l . Ojli — aj6'(0o,j) are n independent random 
variables with mean zero and satisfy 

E \aiY, - aib' (^o,.)!" = l^^E \Y, - b' (^o,.)!" < larmlM"^-^^ 

Thus an application of Bernstein's inequality (see, e.g., Bennett, 1962 or van der Vaart and 
Wellner, 1996) yields 



P(|a'^Y- a'^/x(0o)| > < 2exp 

= 2 exp 



1 £2 



2Er=i«^o + ||a|LMe 
1 

2 ||a||2 + ||a||ooMe 



which concludes the proof. 
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8.6 Proof of Theorem [2] 

We break the whole proof into several steps. Let Xi and X2 respectively be the submatrices 
of X formed by columns in TIq = supp(/3o) and its complement TIq, and 6q = 'K(3q. Let 
^ = (^1, • • • , S.p)'^ = X-^y — X-^/^(0o)- Consider events 

^1 = {ll^artolL ^ q^/V'^log?^} and ^2 = { ^gjtg ^ < '"n\/ra| , 

where u„ = ^'^■^n^/^~"(logn)-'^/^ is a diverging sequence and denotes a subvector of v 
consisting of elements in A. Since ||xj||2 = \/n, it follows from Bonferroni's inequality and 
(f22D that 

P{£in£2) (39) 

> 1 - ^ P (le^l > C-'/'^ATb^) (101 > UnV^) 

> 1 - 2 \sn-^ + (p-s) e-"^''" 



j^sn ^ + {p 
1 — 2[sn~^ + (p — s)e 



1 I r„ „\„-nl-2" Wni 



where s = ||/9ollo and Un < v^/ niax^^j^ ll^illoo for unbounded responses, which is guaranteed 
for sufficiently large n by Condition [3l Under the event £1(182, we will show that there exists 
a solution /3 G R*^ to dZ])-® with sgn(/3) = sgn(/3g) and \\f3 — /3o||oo = 0(n"''' logn), where 
the function sgn is applied componentwise. 

Step 1: Existence of a solution to equation We first prove that for sufficiently large 
n, equation ^ has a solution (3^ inside the hypercube 

M={5e'Rf : ||(5-/3i||oo = n-'^logn}. 

For any 6 = {5i, - ■ ■ , 5sY' G AA, since (i„ > logn, we have 

min^^i \6j\ > min^gOT;, \(3oj\ - dn = dn (40) 

and sgn(d) = sgn(/3i). Let r] = nXnp{5). Using the monotonicity condition of p'{t), by (PU|) 
we have 

< nXnp'{dn), 
which along with the definition of £1 entails 

II^OTo ~ '^lU < q ^^^v^nlogn + n\np'{dn). (41) 

Define vector-valued functions 

7(5) = (7i(5),--- ,7p(^))^ = xV(Xi6), 5gR^ 
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and 

Then, equation ^ is equivalent to ^{6) = 0. We need to show that the latter has a solution 
inside the hypercube M. To this end, we represent 'y^{S) by using a second order Taylor 
expansion around with the Lagrange remainder term componentwise and obtain 



where r = (ri, 



Im^iS) = 7an„(/3i) + Xf S (Oo) Xi(5 - (3,) + r, 
)^ and for each j = 1, • • • , s. 



(42) 



with dj some s-vector lying on the line segment joining S and /3^. By (jl7p . we have 



r||^ < max max-Amax [xf diag {|xj| o (Xi5o)|} Xi] \\S - f3 



2 
lll2 



(43) 



O [sn 



1-27 



(log n) 



Let 



*(5) = [Xf S (0o) Xi] ' *(5) = 5 - /3i + u, 



(44) 



where u = -[^\j:{GQ)^i]-^{$,^-r]-v). It follows from dH]), (03]), and ((E]) in Condition 
[2] that for any S e Af, 



Moo< 



[Xfl](0o)Xi 



, -1 



l^mo ~ ■'^lloo + l|r||oo) 



(45) 



O 



bsu ^/'^^/\ogn + bs\np'{dn) +bssn ^'^(logn)^ 



By Condition [31 the first and third terms are of order o{n logn) and so is the second term 
by m]). This shows that 



o{n '^logn). 



By ()44p . for sufficiently large n, if {8 — /3i)j = n we have 



>0, 



and if [5 — 



-n 



'yTogn, we have 



< -n-^yb^+ ||u||^ < 0, 



(46) 



(47) 



where ^{5) = (^'i(5), • • • , ^ si^))^ ■ By the continuity of the vector-valued function ^{6), 
dM]) and ([^7]) . an application of Miranda's existence theorem (see, e.g., Vrahatis, 1989) shows 
that equation ^{8) = has a solution f3i in M . Clearly (3i also solves equation ^{8) = 
in view of (jS]). Thus we have shown that equation ([7]) indeed has a solution /3i in M. 
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Step 2: Verification of condition Let (3 G with /Sgjjjj = Pi G a solution to 
equation ([7]) and /Sgrjjc = 0, and = X/3. We now show that /3 satisfies inequaUty ([8]) for A„ 
given by ([H]). Note that 



(nA„)-i { [X^y - X^/x (0o)] - [xi^M (^) " X^^/x (0o)] } 
(nA„)"^ l^gjjc - 7ane(3i) - 7aTi§(/3i) } • 



(48) 



On the event £2, the L^o norm of the first term is bounded by 0{n^^^^UnXn^) = o(l) by the 
condition on A„. It remains to bound the second term of (|48p . 
A Taylor expansion of 7grr[c (S) around (3i componentwise gives 

7mig(3i) = 79Jtf,(/3i) + ^2^ (do) Xi(3i - /3i) + w, (49) 



where w = {ws+i, 



with Wj = ^{Pi — Pi)'^V^'jj{Sj){Pi — Pi) and dj some s-vector 



lying on the line segment joining Pi and Pi- By (jl7p in Condition [2] and Pi £ J\f, arguing 
similarly to (j43p . we have 



w 



= O [sn^-'^^ {log nf] . 
Since Pi solves equation "^{S) = in (jH]), we have 

Pi-Pi = [Xf S (0o) Xi] (^3^, - r, - r) . 
It follows from and ([16]) in Condition O (I3I|), (03]), and (|l8])-(l5l]) that 

||z|L < 0(1) + (nA„)-i ||7ajtg(3i) - 7OTg(/3i)L 
< 0(1) + (nA„)-^ Xl^E (0o) Xi [Xf S (0o) Xi] 



(50) 



(51) 



-1 



w 



< 0(1) + (nA„)~^0|n°i v^nlogn + sn^~27(iog^)2 + ^^^-^'^(log n)2| 
+ X^S(0o)Xi [Xf5](0o)Xi]"' p'(d„). 
The second term is of order 0(A^^n^"(log n)^) = o(l) by (fTS]) . Using (fTBl) . we have 

||z||oo<Cp'(0+) + o(l)<p'(0+) 

for sufficiently large n. 

Finally, note that condition ([9]) for sufficiently large n is guaranteed by A„ko = o{tq) in 
Condition [3l Therefore, by Theorem [T] we have shown that /3 = {Pi ,P2 is a strict local 
maximizer of the non-concave penalized likelihood Qn{P) ® with ||/3 — /3o||co = 0{n~^ \ogn) 
and P2 = ^ under the event £ir\ £2- These along with ([39]) prove parts a) and b). This 
completes the proof. 
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8.7 Proof of Theorem H 

We continue to adopt the notation in the proof of Theorem [2j To prove the conclusions, it 
suffices to show that under the given regularity conditions, there exists a strict local maxi- 
mizer (3 of the penalized likelihood QniP) in © such that 1) = with probability tending 
to 1 as n ^ oo (i.e., sparsity), and 2) \\(3i — Pi\\2 = Op{^/sJn) (i.e., ^/s/ n-consistency) . 

Step 1: Consistency in the s- dimensional subspace. We first constrain Qni(3) on the 
s-dimensional subspace {P G : /flgjig = 0} of R^. This constrained penalized likelihood is 
given by 

s 

Q^{S)=US)-Y,PxM\)^ (52) 
j=i 

where Jn{^) = n'~^['y'^^iS — l'^b(Xi5)] and 6 = {Si,-- - ,(5s)'^. We now show that there 
exists a strict local maximizer P-^ of Qn{^) such that 11/3^ — f3i\\2 = Op{^J s /n). To this end, 
we define an event 

Hn = \ QniPi) > max Q^{S) \ , 
I SedNr J 

where dNr denotes the boundary of the closed set Nr = {S G H'^ : \\d — Pi\\2 < \/s/nT} 
and r G (0, oo). Clearly, on the event Hn, there exists a local maximizer (3i of Qni^) in ^r- 
Thus, we need only to show that P{Hn) is close to 1 as n — > oo when r is large. To this end, 
we need to analyze the function Q„ on the boundary dNr- 

Let n be sufficiently large such that ^fsJriT < dn since dn ^ \/ sjn by Condition [5l It 
is easy to see that b = {b\, - - - , 6s)'^ € Nr entails sgn(5) = sgn(/3]^), \\S — /3i||oo < dn, and 
miuj \6j\ > dn. By Taylor's theorem, we have for any 6 £ N^, 

Qn{6) - QniPl) = {S- P,fv - 1(5 - (3,fBiS - (3,), (53) 

where v = n-^Xf [y - /i(0o)] - PxAPi), Oq = X/3o = ^iPi, 

D = n-^Xf S {6*) Xi + diag {p^ (|/3*l)} , 

9* = Xi/3*, and /3* lies on the line segment joining d and Pi- More generally, when the 
second derivative of the penalty function px does not necessarily exist, it is easy to show 
that the second part of the matrix D can be replaced by a diagonal matrix with maximum 
absolute element bounded by XnHQ. Recall that 

AAo = {beR^: ||b-/3i||oo < d„} 

and Ko = max^jgj^^ «;(/9; b), where K{p;h) is given by ([6]). For any S £ dNr, we have 
\\^ — Pi\\2 = \/s/nT and /3* G A/q. Then for sufficiently large n, by ([26l) and A„ko = o(l) in 
Conditions 0] and [5] we have 

Amm(D) > C - XnKo > 
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Thus by ([53]). we have 



max Qn{S) - QniPi) < Vs/nr ||v||2 - cy/s/nT/4: 

S&dNr 



which along with Markov's inequahty entails that 



PiHn) > P ||v||^ < 



,2 c^sr^ 



16n 



> 1 



l6nE llvllo 



It follows from Ey = /^(^o)) cov(y) = (j)'S{6Q), and Conditions U] and [S] that 
E ||v||^ = n-^E ||xf [y - (^o)]!!' + \\PxM)\\l 

since p'x^it) is decreasing in t e [0, cxd). Hence, we have 

P{Hn) > l-0(r-2). 

This proves — Pi\\2 = Op{^/ s/n). 

Step 2: Sparsity. Let (3 G with /Jgrjj^j = /3i G A'',- C A/q a strict local maximizer of 
Qn{S) and = Ptxn^ = 0, and = X/3. It remains to prove that the vector (3 is indeed 
a strict local maximizer of Qn{f3) on the space R^. From the proof of Theorem [H we see 
that it suffices to check condition The idea is the same as that in Step 2 of the proof of 
Theorem [2J Let ^ = {^i, ■ ' ' j Cp)'^ = ^"^y ~ ^"^/^(^o) ^-nd consider the event 



£2 



where u„, = c-^ ^/^n"/^^logn. We have shown in the proof of Theorem [2] that 

P{£2) >1-{P- sMun) > 1 - 2pe-'=i"" ^ 1, 
since logp = 0(n"). It follows from ^ and (l28|) in Condition d (gH]), (gl]) that 



(54) 



< ("-An) 



-1 



+ 



X^/x -X^/i(0o) 



o(l) + (nA„)-i Xl^5](0o)Xi(3i-/3i: 



+ w 



0(1) + (nA, 



0(n) 



+ 0(n) 



= o(l)+0(A^i7V^rj =0(1), 
which shows that inequality ([5]) holds for sufficiently large n. This concludes the proof. 
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8.8 Proof of Theorem H 



Clearly by Theorem [3l we only need to prove the asymptotic normality of f3i. On the event 
Hn defined in the proof of Theorem [3l it has been shown that f3i G Nr C A/q is a strict local 
maximizer of Q^i^) and = 0- It follows easily that VQ„(/3x) = 0. In view of ([52]) . we 
have 

VQJ5)=V4W-PA„(<5). 

We expand the first term V£„(5) around (3i to the first order componentwise. Then by ([28]) 
in Condition m and — /3i||2 = Op{^J s/n), we have under the L2 norm, 



(55) 



(56) 



= VQ„(/3i) = V4(/3i) - n-^Xf S (0o) - Pi) 

+ o(i)||3i-/3i|[vi-PA„(3i) 

= n-'XJ [y - fi{eo)] - n-ixf E (0o) Xi(3i " f^i) 

It follows from /3i G A/q, and p'^^{dn) = o(s"^/^n~^/^) in Condition [6] that 

Px„0i) 2 < VsPA„('^n) = op(l/v/^), 

due to the monotonicity of p'^^^ (t) . Combing ([55]) and ([56]) gives 

Xf S (Oo) Xi(3i - /3i) = Xf [y - /x(0o)] + op(^A^), 

since s = o(n^/^). This along with the first part of ()26p in Condition [4] entails 

(3i - Pi) = B-i/'Xf [y - /x(0o)] + op(l), 

where B„ = X^Xl(0o)Xi and the small order term is understood under the L2 norm. 

We are now ready to show the asymptotic normality of Pi. Let A^A^ — > G, where A^ 
is a g X s matrix and G is a symmetric positive definite matrix. It follows from (j57p that 



(57) 



A„By2 (^f3i-Pij =u„ + op(l), 
where u„ = A„Bn"^^'x^[y — /^(0o)]- Thus by Slutsky's lemma, to show that 



9 



it suffices to prove u„ — > N{0, (j)G). For any unit vector a G R"?, we consider the asymptotic 
distribution of the linear combination 



a^u„ = a^A„B-V2xf [y - /x(0o)] = ^i, 



i=l 
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where = a-^ A„B„/'^^^Zj [yj — 6'(^o,i)] and Xi = (zi, • • • ,z„)-^. Clearly ^j's are independent 
and have mean 0, and 
n 

5^ var(e.) = a^A„B-V20 [xfs (0o) Xi] B'^^^^^ 



6a'^A„A^a — > ^a'^Ga 



as n ^ cxD. By Condition [B] and the Cauchy-Schwarz inequality, we have 
^ E = |a^A„B-i/2z,| E \y, - b' (0o,)|' 

i=l i=l 

n 

= 0(1) |a^A„B-i/2zi 

n 

<0(1) j;||a^A„||2 



B-i/2 



1=1 



i=l 



Therefore an application of Lyapunov's theorem yields 



a^u„ = Y^i^ N{0, 0a^Ga). 



i=l 



Since this asymptotic normality holds for any unit vector a G R'', we conclude that 
N{0,(j)G), which completes the proof. 



A Appendix 

A.l Three commonly used GLMs 

In this section we give the formulas used in the ICA algorithm for three commonly used 
GLMs: linear regression model, logistic regression model, and Poisson regression model. 

Linear regression. For this model, b{9) = ^0"^, ^ G R and (p = a"^ . The penalized 
likelihood Qnif3) in ([3]) can be written as 



p 

Qn{f3) = - { (2n)-i ||y - X(3\\l + Vpa 



12 ^ Z^J 

i=i 



j\J ( ' 



(58) 



where /3 = (/3i,--- , f3p)'^ ■ Thus maximizing Qn{P) becomes the penalized least squares 
problem. In Step 3 of ICA, we have QniPj', (3 ,j) = Qn{(3), where the subvector of /3 with 
components in {1, • • • \ {j} is identical to that of /3 
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Logistic regression. For this model, b{9) = log(l + e^), 6* G R and = 1. In Step 3 of 
ICA, by jMl) we have 

Qn(/5,;3^', j) = 4(3^') + {xj [y - M (x3^'=)] } - ^^^) (59) 
2n 



xJS (X3 x,J (/?, - - ^pA, m) 



where the subvector of f3 with components in {!,••• ,p} \ {j} is identical to that of /3 
3"' = (^^, • • • J^-f , X = (xi, • . . ,xp), /X(X3'') = • • • , ^)^, and 

E(X/3 = diag ' ' 



(l + e^i)^' '(l + e^-)^ 



mm ■ 

/3GR 



with (01,- •• ,0^)' =X/3 

Poisson regression. For this model, 6(0) = e^, € R and cf) = 1. In Step 3 of ICA, 
Qnif3j',P ^ ,j) has the same expression as in (j59|) with 

/x(x3^') = (e^'i,--- ,e^")^ and S(x3^''') = diag je^^ , • • • , e^" } , 
where (0i,--- ,0^)^ = x3^'. 

A. 2 SCAD penalized least squares solution 

Consider the univariate SCAD penalized least squares problem 

.{2-i(z-/3)2 + ApA(|/?|)}, (60) 

where z G R, A G (0,co), and p\ is the SCAD penalty given by The solution when 
A = 1 was given by Fan (1997). We denote by R{P) the objective function and P{z) the 
minimizer of problem (|60p . Clearly (3{z) equals or solves the gradient equation 

g{(3) ^ Vfs {2-1 {z - (if + ApA m)} =f3-z + sgn{P)ApM = 0. (61) 

It is easy to show that P{z) = sgn{z)\P{z)\ and \P{z)\ < \z\, i.e., (3{z) is between and z. 
Let zo = sgn(z)(|z| — AA)+. 

1) If \z\ < A, we can easily show that l3{z) = zq. 

2) Let A < l^l < aX. Note that g defined in (j6ip is piecewise linear between and z, and 
g{0) =sgn(z)[-|z| + AA], g{sgn{z)\) = sgn(2;)[-|z| + (A + 1)A], g{z) = sgn{z)Ap'^{\z\)]. Thus 
it is easy to see that if \z\ < (A + 1)A, we have f3{z) = zq, and if \z\ > (A + 1)A, we have 

3) Let \z\ > aX. The same argument as in 2) shows that when \z\ < (A + 1)A, we have 
P{z) = Zo if R{zo) < R{z) and (3{z) = z otherwise. When \z\ > (A + 1)A, we have P{z) = z. 
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